{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# In-Depth Platform Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [Overview](#in-depth-overview)\n",
    "- [Collecting and Ingesting Data](#data-collection-and-ingestion)\n",
    "- [Exploring and Processing Data](#data-exploration-and-processing)\n",
    "- [Building and Training Models](#building-and-training-models)\n",
    "- [Deploying Models to Production](#deploying-models-to-production)\n",
    "- [Visualization, Monitoring, and Logging](#visualization-monitoring-and-logging)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"in-depth-overview\"></a>\n",
    "## Overview\n",
    "\n",
    "This document provides an in-depth overview of the Iguazio Data Science Platform (**\"the platform\"**) and how to use it to implement a full data science workflow.\n",
    "\n",
    "> **Note:** Start out by first reading the platform introduction and high-level overview in [**welcome.ipynb**](welcome.ipynb) (notebook) or [**README.md**](README.md) (Markdown).\n",
    "\n",
    "The platform uses [Kubernetes](https://kubernetes.io) (k8s) as the baseline cluster manager, and deploys various application microservices on top of Kubernetes to address different data science tasks.\n",
    "Most of the provided services support scaling out and GPU acceleration and have a secure and low-latency access to the platform's shared data store and file system, enabling high performance and scalability with maximum resource efficiency.\n",
    "\n",
    "The platform makes extensive use of [Nuclio serverless functions](https://github.com/nuclio/nuclio) to automate various tasks &mdash; such as data collection, extract-transform-load (ETL) processes, model serving, and batch jobs.\n",
    "Nuclio functions describe the code and include all the required resource definitions and configuration for running the code.\n",
    "The functions auto scale and can be versioned.\n",
    "The platform supports various methods for generating Nuclio functions &mdash; using the graphical dashboard, Docker, Git, or Jupyter Notebook &mdash; as demonstrated in the platform tutorials.\n",
    "\n",
    "The following sections detail how you can use the platform to implement all stages of a data science workflow from research to production."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"data-collection-and-ingestion\"></a>\n",
    "## Collecting and Ingesting Data\n",
    "\n",
    "There are many ways to collect and ingest data from various sources into the platform:\n",
    "\n",
    "- Streaming data in real time from sources such as Kafka, Kinesis, Azure Event Hubs, or Google Pub/Sub.\n",
    "- Loading data directly from external databases using an event-driven or periodic/scheduled implementation.\n",
    "  See the explanation and examples in the [**read-external-db**](data-ingestion-and-preparation/read-external-db.ipynb#ingest-from-external-db-to-no-sql-using-frames) tutorial.\n",
    "- Loading files (objects), in any format (for example, CSV, Parquet, JSON, or a binary image), from internal or external sources such as Amazon S3 or Hadoop.\n",
    "  See, for example, the [**file-access**](data-ingestion-and-preparation/file-access.ipynb) tutorial.\n",
    "- Importing time-series telemetry data using a Prometheus compatible scraping API.\n",
    "- Ingesting (writing) data directly into the system using RESTful AWS-like simple-object, streaming, or NoSQL APIs.\n",
    "  See the platform's [Web-API References](https://www.iguazio.com/docs/latest-release/reference/api-reference/web-apis/).\n",
    "\n",
    "For more information and examples of data collection, ingestion, and preparation with the platform, see the [**basic-data-ingestion-and-preparation**](data-ingestion-and-preparation/basic-data-ingestion-and-preparation.ipynb#gs-data-collection-and-ingestion) tutorial Jupyter notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"data-exploration-and-processing\"></a>\n",
    "## Exploring and Processing Data\n",
    "\n",
    "The platform includes a wide range of integrated open-source data query and exploration tools, including the following:\n",
    "\n",
    "- [Apache Spark](https://spark.apache.org/) data-processing engine &mdash; including the Spark SQL and Datasets, MLlib, R, and GraphX libraries &mdash; with real-time access to the platform's NoSQL data store and file system.\n",
    "  See the platform's [Spark APIs reference](https://www.iguazio.com/docs/latest-release/reference/api-reference/spark-apis/) and the examples in the [**spark-sql-analytics**](data-ingestion-and-preparation/spark-sql-analytics.ipynb) tutorial.\n",
    "- [Presto](https://prestosql.io/) distributed SQL query engine, which can be used to run interactive SQL queries over platform NoSQL tables or other object (file) data sources.\n",
    "  See the platform's [Presto reference](https://www.iguazio.com/docs/latest-release/reference/presto/).\n",
    "- [pandas](https://pandas.pydata.org/) Python analysis library, including structured DataFrames.\n",
    "- [Dask](https://dask.org/) parallel-computing Python library, including scaled pandas DataFrames.\n",
    "- [V3IO Frames](https://github.com/v3io/frames) &mdash; Iguazio's open-source data-access library, which provides a unified high-performance API for accessing NoSQL, stream, and time-series data in the platform's data store and features native integration with pandas and [NVIDIA RAPIDS](https://rapids.ai/).\n",
    "  See, for example, the [**frames**](data-ingestion-and-preparation/frames.ipynb) tutorial.\n",
    "- Built-in support for ML packages such as [scikit-learn](https://scikit-learn.org), [Pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html), [NumPy](http://www.numpy.org/), [PyTorch](https://pytorch.org/), and [TensorFlow](https://www.tensorflow.org/).\n",
    "\n",
    "All these tools are integrated with the platform's Jupyter Notebook service, allowing users to access the same data from Jupyter through different interfaces with minimal configuration overhead.\n",
    "Users can easily install additional Python packages by using the [Conda](https://anaconda.org/anaconda/conda) binary package and environment manager and the [pip](https://pip.pypa.io) Python package installer, which are both available as part of the Jupyter Notebook service.\n",
    "This design, coupled with the platform's unified data model, enables users to store and access data using different formats &mdash; such as NoSQL (\"key/value\"), time series, stream data, and files (simple objects) &mdash; and leverage different tools and APIs for accessing and manipulating the data, all from a single development environment (namely, Jupyter Notebook).\n",
    "\n",
    "> **Note:** You can deploy and manage application services, such as Spark and Jupyter Notebook, from the **Services** page of the platform dashboard.\n",
    "\n",
    "For more information and examples of data exploration with the platform, see the [**getting-started-tutorial**](getting-started-tutorial/getting-started-tutorial.ipynb) and the (data-ingestion-and-preparation/basic-data-ingestion-and-preparation.ipynb) tutorial Jupyter notebooks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"building-and-training-models\"></a>\n",
    "## Building and Training Models\n",
    "\n",
    "You can develop and test data science models in the platform's Jupyter Notebook service or in your preferred external editor.\n",
    "When your model is ready, you can train it in Jupyter Notebook or by using scalable cluster resources such as Nuclio functions, Dask, Spark ML, or Kubernetes jobs.\n",
    "You can find model-training examples in the following platform demos; for more information and download instructions, see [**welcome.ipynb**](welcome.ipynb#end-to-end-use-case-applications) (notebook) or [**README.md**](README.md#end-to-end-use-case-applications) (Markdown):\n",
    "\n",
    "- The NetOps demo demonstrates predictive infrastructure-monitoring using scikit-learn.\n",
    "- The image-classification demo demonstrates image recognition using TensorFlow and Horovod with MLRun.\n",
    "\n",
    "If you're are a beginner, you might find the following ML guide useful &mdash; [Machine Learning Algorithms In Layman's Terms](https://towardsdatascience.com/machine-learning-algorithms-in-laymans-terms-part-1-d0368d769a7b)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"experiment-tracking\"></a>\n",
    "### Experiment Tracking\n",
    "\n",
    "One of the most important and challenging areas of managing a data science environment is the ability to track experiments.\n",
    "Data scientists need a simple way to track and view current and historical experiments along with the metadata that is associated with each experiment. This capability is critical for comparing different runs, and eventually helps to determine the best model and configuration for production deployment.\n",
    "\n",
    "The platform leverages the open-source [MLRun](https://github.com/mlrun/mlrun) library to help tackle these challenges. You can find examples of using MLRun in the [MLRun demos](https://github.com/mlrun/demos/).\n",
    "For information about retrieving and updating local copies of the MLRun demos, see [**welcome.ipynb**](welcome.ipynb#end-to-end-use-case-applications) (notebook) or [**README.md**](README.md#end-to-end-use-case-applications) (Markdown)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"deploying-models-to-production\"></a>\n",
    "## Deploying Models to Production\n",
    "\n",
    "The platform allows you to easily deploy your models to production in a reproducible way by using the open-source Nuclio serverless framework.\n",
    "You provide Nuclio with code or Jupyter notebooks, resource definitions (such as CPU, memory, and GPU), environment variables, package or software dependencies, data links, and trigger information.\n",
    "Nuclio uses this information to automatically build the code, generate custom container images, and connect them to the relevant compute or data resources.\n",
    "The functions can be triggered by a wide variety of event sources, including the most commonly used streaming and messaging protocols, HTTP APIs, scheduled (cron) tasks, and batch jobs.\n",
    "\n",
    "Nuclio functions can be created from the platform dashboard or by using standard code IDEs, and can be deployed on your platform cluster.\n",
    "A convenient way to develop and deploy Nuclio functions is by using Jupyter Notebook and Python tools.\n",
    "For detailed information about Nuclio, visit the [Nuclio web site](https://nuclio.io/) and see the product [documentation](https://nuclio.io/docs/latest/).\n",
    "\n",
    "> **Note:** Nuclio functions aren't limited to model serving: they can automate data collection, serve custom APIs, build real-time feature vectors, drive triggers, and more.\n",
    "\n",
    "For an overview of Nuclio and how to develop, document, and deploy serverless Python Nuclio functions from Jupyter Notebook, see the [nuclio-jupyter documentation](https://github.com/nuclio/nuclio-jupyter/blob/master/README.md).\n",
    "You can also find examples in the platform tutorials.\n",
    "For example, the NetOps demo demonstrates how to deploy a network-operations model as a function; for more information about this demo and how to get it, see [**welcome.ipynb**](welcome.ipynb#end-to-end-use-case-applications) (notebook) or [**README.md**](README.md#end-to-end-use-case-applications) (Markdown)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"visualization-monitoring-and-logging\"></a>\n",
    "## Visualization, Monitoring, and Logging\n",
    "\n",
    "Data in the platform &mdash; including collected data, internal or external telemetry and logs, and program-output data &mdash; can be analyzed and visualized in different ways simultaneously.\n",
    "The platform supports multiple standard data analytics and visualization tools, including SQL, Prometheus, Grafana, and pandas.\n",
    "For example, you can plot or chart data within Jupyter Notebook using [Matplotlib](https://matplotlib.org/); use your favorite BI visualization tools, such as [Tableau](https://www.tableau.com), to query data in the platform over a Java database connectivity connector (JDBC); or build real-time dashboards in Grafana.\n",
    "\n",
    "The data analytics and visualization tools and services generate telemetry and log data that can be stored using the platform's time-series database (TSDB) service or by using external tools such as [Elasticsearch](https://www.elastic.co/products/elasticsearch).\n",
    "Platform users can easily instrument code and functions to collect various statistics or logs, and explore the collected data in real time.\n",
    "\n",
    "The [Grafana](https://grafana.com/grafana) open-source analytics and monitoring framework is natively integrated into the platform, allowing users to create dashboards that provide access to platform NoSQL tables and time-series databases from different dashboard widgets.\n",
    "You can also create Grafana dashboards programmatically (for example, from Jupyter Notebook) using wizard scripts.\n",
    "For information on how to create Grafana dashboards to monitor and visualize data in the platform, see [Adding a Custom Grafana Dashboard](https://www.iguazio.com/docs/latest-release/tutorials/getting-started/grafana-dashboards/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
